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' Abstract 

We show that the Brier game of prediction is mixable and find the opti- 
C/3 , mallearning rate and substitution function for it. The resulting prediction 

algorithm is applied to predict results of football and tennis matches. The 
theoretical performance guarantee turns out to be rather tight on these 
, data sets, especially in the case of the more extensive tennis data. 

> 

00 . 1 Introduction 

I The paradigm of prediction with expert advice was introduced in the late 1980s 

(see, e.g., [5], [11], [2]) and has been applied to various loss functions; see [3] for a 
recent book-length review. An especially important class of loss functions is that 
of "mixable" ones, for which the learner's loss can be made as small as the best 
expert's loss plus a constant (depending on the number of experts). It is known 
[8, 14] that the optimal additive constant is attained by the "strong aggregating 
I algorithm" proposed in [13] (we use the adjective "strong" to distinguish it from 

^ ■ the "weak aggregating algorithm" of [9]). 

I There are several important loss functions that have been shown to be mix- 

able and for which the optimal additive constant has been found. The prime 
examples in the case of binary observations are the log loss function and the 
square loss function. The log loss function, whose mixability is obvious, has been 
explored extensively, along with its important generalizations, the KuUback- 
Leibler divergence and Cover's loss function. 

In this paper we concentrate on the square loss function. In the binary case, 
its mixability was demonstrated in [13]. There are two natural directions in 
which this result could be generalized: 

Regression: observations are real numbers (square-loss regression is a standard 
problem in statistics). 
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Classification: observations take values in a finite set (this leads to the "Brier 
game" , to be defined below, a standard way of measuring the quality of 
predictions in meteorology and other applied fields: see, e.g., [4]). 

The mixability of the square loss function in the case of observations belonging 
to a bounded interval of real numbers was demonstrated in [8]; Hausslcr ct al.'s 
algorithm was simplified in [16]. Surprisingly, the case of square- loss non-binary 
classification has never been analysed in the framework of prediction with expert 
advice. The purpose of this paper is to fill this gap. Its short conference version 
[17] appeared in the ICML 2008 proceedings. 

2 Prediction algorithm and loss bound 

A game of prediction consists of three components: the observation space Q, 
the decision space F, and the loss function A : O x F — > K. In this paper we 

are interested in the following Brier game [1]: f7 is a finite and non-empty set, 
r := V{^) is the set of all probability measures on il, and 

A(a;,7) = E(7W-^-W)% 

where S P{^) is the probability measure concentrated at lo: 6u{co} = 1 
and Suj{o} = for o 7^ w. (For example, if O = {1,2,3}, w = 1, 7{1} = 1/2, 
7{2} = 1/4, and7{3} = 1/4, A(w,7) = (l/2-l)2+(l/4-0)2+(l/4-0)2 = 3/8.) 

The game of prediction is being played repeatedly by a learner having access 
to decisions made by a pool of experts, which leads to the following prediction 
protocol: 



Protocol 1 Prediction with expert advice 
Lo 0. 

L^:=0,k=l,...,K. 
for A/^= 1,2, ... do 

Expert k announces 7^ G F, fc = 1, . . . , /C. 

Learner announces 7jv G F. 

Reality announces wjv S fl. 

I/jv := Ln_i + X{uJi\f, 7Ar). 

L% := L%_-^ + A(a;jv, 7^), k=l,...,K. 
end for 



At each step of Protocol 1 Learner is given K experts' advice and is required 
to come up with his own decision; Ln is his cumulative loss over the first N 
steps, and is the fcth expert's cumulative loss over the first A'' steps. In 
the case of the Brier game, the decisions are probability forecasts for the next 
observation. 

An optimal (in the sense of Theorem 1 below) strategy for Learner in pre- 
diction with expert advice for the Brier game is given by the strong aggregating 
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algorithm. For each expert k, the algorithm maintains its weight w'' , constantly 
slashing the weights of less successful experts. Its description uses the notation 
t+ := max(t,0). 



Algorithm 1 Strong aggregating algorithm for the Brier game 

Wq := 1, k = 1, . . . ,K. 
for TV = 1,2,... do 

Read the Experts' predictions 7^, fc = 1, . . . , ii'. 

Set Gn{uj) := -lnX;f=l^i'w-le"^^"'^"^ w e f^. 
Solve E^eiiC-'* - GAr(a)))+ = 2 in s e R. 
Set 7Ar{w} (s - GAr(a)))+/2, CO Gfl. 
Output prediction 7jv £ V{^1). 
Read observation lon- 

end for 



The algorithm will be derived in Section 5. The following result (to be proved 
in Section 4) gives a performance guarantee for it that cannot be improved by 
any other prediction algorithm. 

Theorem 1. Using Algorithm 1 as Learner's strategy in Protocol 1 for the Brier 
game guarantees that 

Ln < min L% + \nK (1) 

k=l,...,K 

for all N = 1,2, If A < InK, Learner does not have a strategy guaranteeing 

Ln < min L'I + A (2) 

k=l,...,K 

for all N = 1,2, ... . 

The second part of this theorem follows from its special case with |r2| = 2 (the 
binary case). However, we are not aware of a proof of this result in the binary 
case, and we will not use this reduction. 

3 Experimental results 

In our first empirical study of Algorithm 1 wc use historical data about 6473 
matches in various English football league competitions, namely: the Premier 
League (the pinnacle of the English football system), the Football League Cham- 
pionship, Football League One, Football League Two, the Football Confer- 
ence. Our data, provided by Football-Data, cover three seasons, 2005/2006, 
2006/2007, and 2007/2008. (The 2007/2008 season ended in May shortly after 
the ICML 2008 submission deadline, and so the data set used in the conference 
version [17] of this paper covered only part of that season, with 6416 matches 
in total.) The matches are sorted first by date, then by league, and then by 
the name of the home team. In the terminology of our prediction protocol, the 
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outcome of each match is the observation, taking one of three possible values, 
"home win", "draw", or "away win"; we will encode the possible values as 1, 2, 
and 3. 

For each match we have forecasts made by a range of bookmakers. We chose 
eight bookmakers for which we have enough data over a long period of time, 
namely Bet365, Bet&Win, Gamebookers, Interwetten, Ladbrokes, Sportingbet, 
Stan James, and VC Bet. (And the seasons mentioned above were chosen 
because the forecasts of these bookmakers are available for them.) 

A probability forecast for the next observation is essentially a vector 
{pi,P2,p:i) consisting of positive numbers summing to 1. The bookmakers do 
not announce these numbers directly; instead, they quote three betting odds, 
ai, 02, and 03. Each number is the amount which the bookmaker undertakes 
to pay out to a client betting on outcome i per unit stake in the event that i 
happens (the stake itself is never returned to the bettor, which makes all betting 
odds greater than 1; i.e., the odds are announced according to the "continental" 
rather than "traditional" system). The inverse value l/ui, i G {1,2,3}, can be 
interpreted as the bookmaker's quoted probability for the observation i. The 
bookmaker's quoted probabilities are usually slightly (because of the competi- 
tion with other bookmakers) in his favour: the sum 1/ai + 1/02 + 1/03 exceeds 
1 by the amount called the overround (at most 0.15 in the vast majority of 
cases). We used 

Pi-= ^ ^ -, / . 1,2,3, (3) 
1/ai + l/a2 + l/as 

as the bookmaker's forecasts; it is clear that P1+P2+P3 = 1- 

The results of applying Algorithm 1 to the football data, with 8 experts and 
3 possible observations, are shown in Figure 1. Let be the cumulative loss of 
Expert fc, fc = 1, . . . , 8, over the first A'' matches and Ljv be the corresponding 
number for Algorithm 1 (i.e., wc essentially continue to use the notation of 
Theorem 1). The dashed line corresponding to Expert k shows the excess loss 
A'' I— > — Ljv of Expert k over Algorithm 1. The excess loss can be negative, 
but from Theorem 1 we know that it cannot be less than — In 8; this lower bound 
is also shown in Figure 1. Finally, the thick line (the positive part of the x axis) 
is drawn for comparison: this is the excess loss of Algorithm 1 over itself. We 
can sec that at each moment in time the algorithm's cumulative loss is fairly 
close to the cumulative loss of the best expert (at that time; the best expert 
keeps changing over time). 

Figure 2 shows the distribution of the bookmakers' overrounds. Wc can 
see that in most cases overrounds are between 0.05 and 0.15, but there are 
also occasional extreme values, near zero or in excess of 0.3. In Figure 1 one 
bookmaker clearly performs worse than the others. His poor performance may 
be explained by his mean overround being about 0.13, near the top end of the 
distribution in Figure 2. (On one hand, a high overround diminishes the need 
for accurate probability forecasts, and on the other, our estimates (3) of the 
probabilities implicit in the announced odds also become less precise.) 
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Figure 1 : The differenee between the cumulative loss of each of the 8 bookmakers 
(experts) and of Algorithm 1 on the football data. The theoretical lower bound 
— In 8 from Theorem 1 is also shown. 



Figure 3 shows the results of another empirical study, involving data about 

a large number of tennis tournaments in 2004, 2005, 2006, and 2007, with the 
total number of matches 10,087. The tournaments include, e.g., Australian 
Open, French Open, US Open, and Wimbledon; the data is provided by Tennis- 
Data. The matches are sorted by date, then by tournament, and then by the 
winner's name. The data contain information about the winner of each match 
and the betting odds of 4 bookmakers for his/her win and for the opponent's 
win. Therefore, now there are two possible observations (player I's win and 
player 2's win). There are four bookmakers: Bet365, Centrebet, Expekt, and 
Pinnacle Sports. The results in Figure 3 are presented in the same way as in 
Figure 1. 

Typical values of the overround are below 0.1, as shown in Figure 4 (analo- 
gous to Figure 2). 

In both Figure 1 and Figure 3 the cumulative loss of Algorithm 1 is close to 
the cumulative loss of the best expert, despite the fact that some of the experts 
perform poorly. The theoretical bound is not hopelessly loose for the football 
data and is rather tight for the tennis data. The pictures look exactly the same 
when Algorithm 1 is applied in the more realistic manner where the experts' 
weights w'^ are not updated over the matches that are played simultaneously. 

Our second empirical study (Figure 3) is about binary prediction, and so 
the algorithm of [13] could have also been used (and would have given similar 
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Figure 2: The overround distribution histogram for the football data, with 200 
bins of equal size between the minimum and maximum values of the overround. 



results). We included it since we are not aware of any empirical studies even for 
the binary case. 

For comparison with several other popular prediction algorithms, see Ap- 
pendix B. The data used for producing all the figures and tables in this section 
and in Appendix B can be downloaded from http://vovk.net/ICML2008. 

4 Proof of Theorem 1 

This proof will use some basic notions of elementary differential geometry, es- 
pecially those connected with the Gauss-Kronecker curvature of surfaces. (The 
use of curvature in this kind of results is standard: see, e.g., [13] and [8].) All 
definitions that we will need can be found in, e.g., [12]. 

A vector / e (understood to be a function / : f2 — > M) is a superprediction 
if there is 7 e F such that, for all oj G il, A(a;,7) < /(w); the set S of all 
superpredictions is the superprediction set. For each learning rate rj > 0, let 

: — > (0, 00)^ be the homeomorphism defined by 

$^(/) -.iven^ e-'^^^'^\ f G m". (4) 

The image ^ni^) of the superprediction set will be called the rj- exponential 
superprediction set. It is known that 

Ln < min L%+^-^, iV=l,2, 

k=l,...,K n 
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Figure 3: The difference between the cumulative loss of each of the 4 bookmakers 
and of Algorithm 1 on the tennis data. Now the theoretical bound is — In 4. 



can be guaranteed if and only if the ry-exponential superprediction set is convex 
(part "if" for all K and part "only if" iox K ^ oo are proved in [14]; part 
"only if" for all K is proved by Chris Watkins, and the details can be found in 
Appendix A). Comparing this with (1) and (2) we can see that we are required 
to prove that 

• is convex when < 1; 

• is not convex when > 1. 

Define the r]- exponential superprediction surface to be the part of the bound- 
ary of the ry-exponential superprediction set lying inside (0, oo)^. The 
idea of the proof is to check that, for all ry < 1, the Gauss-Kronecker curvature 
of this surface is nowhere vanishing. Even when this is done, however, there 
is still uncertainty as to in which direction the surface is bulging (towards the 
origin or away from it). The standard argument (as in [12], Chapter 12, The- 
orem 6) based on the continuity of the smallest principal curvature shows that 
the ry-exponential superprediction set is bulging away from the origin for small 
enough 77: indeed, since it is true at some point, it is true everywhere on the 
surface. By the continuity in ry this is also true for all ry < 1. Now, since the 
ry-exponential superprediction set is convex for all < 1, it is also convex for 



Let us now check that the Gauss-Kronecker curvature of the //-exponential 
superprediction surface is always positive when r] <1 and is sometimes negative 



r?=l. 
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Figure 4: The overround distribution histogram for the tennis data. 



when T] > 1 (the rest of the proof, an elaboration of the above argument, will 
be easy). Set n := |f2|; without loss of generality we assume fl = {1, . . . ,n}. 

A convenient parametric representation of the ?7-exponential superprediction 
surface is 



V 



-r,((ui)=+(«=-l) =+-+(«") = ) 



_^((„l)2 + ... + („"-l_l)2 + („")2-, 
_^((„l)2 + ... + („n-l)2_^(„n_i)2) 



where u^. 



are the coordinates on the surface, 



(5) 



G (0,1) 



subject to H ^ < 1, and is a shorthand for 1 — ^. The 

derivative of (5) in is 



d 



( ^ 






= 2r; 











(u" - ui)e-''(("')'+("'-i)'+-+("""')'+("")') 

(m" - ui)e-';(("')'+(«')'+-+("""'-i)'+(«")') 
V(w" - - l)e- 
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oc 



/(u"-Mi + l)e2''«'\ 



the derivative in is 



d 



and so on, up to 
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du 



n-l 



( ^ 














oc 











all coefficients of proportionality being equal and positive. 
A normal vector to the surface can be found as 



Z := 



ei 



(u" - u"-^)e2'?''' • • • (u" - u"-^ + l)e^'^''"~' (u" - u""^ - l)e2''"'' 

where is the zth vector in the standard basis of R". The coefficient in front 
of ei is the (n — 1) x (n — 1) determinant 



u" — 
u" - + 1 



OC e 



m" - U" - — 1 

-V? U" - - 1 

• u" - u"-^ + 1 u" - u"-^ - 1 
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= e 



= e 



1 1 

2 1 
1 2 

1 1 

1 1 
1 
1 





1 - - 1 

1 W" - m2 - 1 

1 u" - - 1 



2 u" - u"-i - 1 

1 u" - - 1 





,n-l 



_g-2„ui (-l)"nui cx w^e-^')"' (6) 



(with a positive coefficient of proportionality, p^''. in the first ac: the tliirci equal- 
ity follows from the expansion of the deterniinant along the last column and then 
along the first row). 

Similarly, the coefficient in front of is proportional (with the same co- 
efficient of proportionality) to u'e"^''"' for i = 2,...,n — 1; indeed, the 
(n — 1) X (n — 1) determinant representing the coefficient in front of can 
be reduced to the form analogous to (6) by moving the ith row to the top. 

The coefficient in front of e„ is proportional to 



-2riu^ 



- 1 u" -u^ 



1 

1 



-1 -1 



u — u 



u — u 



u" - + 1 



w " — u 



1 
1 



• • 
•• 



1m"- u"-! + 1 

u" - 

u" - 

1 U" - 

nu" 



(with the coefficient of proportionality e'^'^{—l)^ 
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The Gauss Kroncckcr curvature at the point with coordinates {u^ , . . . , u"~^) 
is proportional (with a positive coefficient of proportionahty, possibly depending 
on the point) to 



(7) 



([12], Chapter 12, Theorem 5, with ^ standing for transposition). 

A straightforward calculation allows us to rewrite determinant (7) (ignoring 
the positive coefficient ((— l)"~^ne^'')") as 



(1 - 2r?Mi)e-2''"' 

(1 - 2r?M2)e-2';«' 







oc 



1 - 2r]u'^ 







- 2r]u^ 



(2?7u" - 1)( 
(2r?u" - 1)( 



(1 - 277M"-i)e-2''"'' ' (2r?u" - l)f 



27?m" 
2r/u" 



••• l-2?7u"-i 2r]M"-l 

= M^(l - 2r/u2)(l - 2ritt') • • • (1 - 2r/u") 
+ u^(l - 27?m1)(1 - 2j?u3) • • • (1 - 2?7u") + • • • 

+ u"(l - 2'nv}){l - 2riu^) • • • (1 - 2?7u"-i) (8) 

(with a positive coefficient of proportionality; to avoid calculation of the parities 
of various permutations, the reader might prefer to prove the last ciquality by 
induction in n, expanding the last determinant along the first column). Our 
next goal is to show that the last expression in (8) is positive when r) < 1 but 
can be negative when 77 > 1. 



If 77 > 1, set = := 1/2 and 



:= 0. The last expression in 



(8) becomes negative. It will remain negative if v} and are sufficiently close 
to 1/2 and u^, . . . , u" arc sufficiently close to 0. 

It remains to consider the case 77 < 1. Set U := 1 — 2r]u^, i = 1, . . . ,n; the 
constraints on the ti are 



-1 < 1 - 277 < < 1, i = l,...,n, 
h + \- tn = n — 2ri > n — 2. 

Our goal is to prove 

(1 - h)t2t3 ■■■tn + - ■■ + {!- tn)tlt2 ■ --tn-l > 0, 



(9) 
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I.e., 



This reduces to 



^2^3 • • • in 



iiti---tn > 0, and to 



tlt2 • ■ • tn-1 > nti ■ ■ - tn- 



- + ... + ->n 

tl tn 
1 1 

- + ••• + - <n 

tl tn 



(10) 

(11) 

(12) 



ii tl - ■ - tn < 0. The remaining case is where some of the ti are zero; for con- 
creteness, let t„ = 0. By (9) we have ti + ■ ■ ■ + tn-i > n — 2, and so all of 
tl, . . . , tn-i arc positive; this shows that (10) is indeed true. 

Let us prove (11). Since • • • t„ > 0, all of fi, . . . , f„ are positive (if two of 
them were negative, the sum ti + ■ ■ ■ + tn would be less than n — 2; cf. (9)). 
Therefore, 

_L + ... + J_>l + . .. + ! = „. 
tl tn ' . ' 

n times 

To establish (10) it remains to prove (12). Suppose, without loss of gener- 
ality, that tl > 0, t2 > 0,. . . , tn-1 > 0, and tn < 0. We will prove a slightly 
stronger statement allowing ti, . . . , tn-2 to take value 1 and removing the lower 
bound on tn- Since the function t G (0, 1] i— > 1/t is convex, we can also assume, 
without loss of generality, ti = ■ ■ ■ = tn-2 = 1- Then tn-i +tn > 0, and so 

1 1 
^ + - < 0; 

^n—l ^n 



therefore. 



tl 



+ ••• + 



tn 



tn 



tn 



< n - 2 < n. 



Finally, let us check that the positivity of the Gauss-Kronecker curvature 
implies the convexity of the ry-cxponcntial supcrprcdiction set in the case rj < 1, 
and the lack of positivity of the Gauss-Kronecker curvature implies the lack 
of convexity of the yy-exponential superprediction set in the case r] > 1. The 
?7-exponential superprediction surface will be oriented by choosing the normal 
vector field directed towards the origin. This can be done since 



(13) 



with both coefficients of proportionality positive (cf. (5) and the bottom row of 
the first determinant in (8)), and the sign of the scalar product of the two vectors 
on the right-hand sides in (13) does not depend on the point (u^, . . . , w"~^). 
Namely, we take (— 1)"Z as the normal vector field directed towards the origin. 
The Gauss-Kronecker curvature will not change sign after the re-orientation: 











/ylg-2r,ui\ 








, Zoc(-l)"-i 




vv 




i^g2r,«"J 




y^n^-2riu"J 
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if n is even, the new orientation coincides with the old, and for odd n the 
Gauss-Kronecker curvature does not depend on the orientation. 

In the case > 1, the Gauss-Kronecker curvature is negative at some point, 
and so the ry-exponcntial superprediction set is not convex ([12], Chapter 13, 
Theorem 1 and its proof). 

It remains to consider the case r] < 1. Because of the continuity of the r]- 
exponcntial superprediction surface in r] we can and will assume, without loss 
of generality, that rj < 1. 

Let us first check that the smallest principal curvature 

of the //-exponential superprediction surface is always positive (among the argu- 
ments of fci we list not only the coordinates u^,. . . , vP'~^ of a point on the surface 
(5) but also the learning rate r] G (0, 1)). At least at some {u^ , . . . , u"~^,ri) the 
value of ki{u^,. . . ,u"~^, rj) is positive: take a sufficiently small -q and the point 

on the surface (5) at which the maximum of a;^ H h a;" is attained (the point 

of the 77-exponential superprediction set at which the maximum is attained will 
lie on the surface since the maximum is attained at {x^ , . . . , x^) = (1, . . . , 1) 
when ?7 = 0). Therefore, for all (u^, . . . , u""^, rj) the value of ki{u^, . . . , ry) 
is positive: if fci had different signs at two points in the set 

{(«!,..., 7?) I G (0, 1), G (0, 1), 

+ + < l,r?G (0,1)}, (14) 

we could connect these points by a continuous curve lying completely inside 

(14); at some point on the curve, fci would be zero, in contradiction to the 
positivity of the Gauss-Kronecker curvature fci • • • kn-i- 

Now it is easy to show that the yy-exponential superprediction set is convex. 
Suppose there are two points A and B on the ?7-exponential superprediction 
surface such that the interval [A, B] contains points outside the ry-exponential 
superprediction set. The intersection of the plane OAB, where O is the origin, 
with the Ty-exponential superprediction surface is a planar curve; the curvature 
of this curve at some point between A and B will be negative (remember that 
the curve is oriented by directing the normal vector field towards the origin), 
contradicting the positivity of ki at that point. 

5 Derivation of the prediction algorithm 

To achieve the loss bound (1) in Theorem 1 Learner can use, as discussed earlier, 
the strong aggregating algorithm (see, e.g., [16], Section 2.1, (15)) with i] — I. 
In this section we will find a substitution function for the strong aggregating 
algorithm for the Brier game with rj < 1, which is the only component of the 
algorithm not described explicitly in [16]. Our substitution function will not 
require that its input, the generalized prediction, should be computed from the 
normalized distribution {w'')^^-^ on the experts; this is a valuable feature for 
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generalizations to an infinite number of experts (as demonstrated in, e.g., [16], 
Appendix A.l). 

Suppose that we are given a generalized prediction {li, . . . , In)'^ computed by 
the aggregating pseudo-algorithm from a normalized distribution on the experts. 
Since (Zi, . . . , is a superprediction (remember that we are assuming 77 < 1), 
we are only required to find a permitted prediction 



/Ai\ 




A2 









(cf. (5)) satisfying 



(wl)2 + (u2_l)2 + ... + („n)2 
V(ul)2 + (u2)2+...+ (u"-l)V 

Ai < ^1, . . . , A„ < Z„. 



(15) 



(16) 



Now suppose wc! are given a generalized prediction (Li, . . . , Ln)'^ computed 
by the aggregating pseudo- algorithm from an unnormalized distribution on the 
experts; in other words, we are given 



(LA 


/h + c\ 




\ln + c) 



for some c G M. To find (15) satisfying (16) we can first find the largest t € K 
such that {Li — t, . . . ,Ln — t)'^ is still a superprediction and then find (15) 
satisfying 

Xl<Li-t,...,Xn<Ln-t. (17) 

Since t > c, it is clear that (Ai, . . . , A^)"^ will also satisfy the required (16). 
Proposition 1. Define s £R by the requirement 



(18) 



The unique solution to the optimization problem t max under the constraints 
(17) with Ai, . . . , A„ as in (15) will be 



1, . . . ,n, 



u = , I . . . 

t = s-l- {u^f (u")2. 



(19) 
(20) 



There exists a unique s satisfying (18) since the left-hand side of (18) is a 
continuous, increasing (strictly increasing when positive) and unbounded above 
function of s. The substitution function is given by (19). 
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Proof of Proposition 1. Let us denote the and t defined by (19) and (20) as 
and t, respectively. To see that they satisfy the constraints (17), notice that 
the ith constraint can be spelt out as 

{u^f + ■■■ + (m")2 - 2«' + 1 < - 

which immediately follows from (19) and (20). As a by-product, we can see that 
the inequality becomes an equality, i.e., 

t = Li-l + 'M -{v'f (tZ")^ (21) 

for all i with > 0. 

We can rewrite (17) as 

t<Lx-l + 2v} - {u^f (u")2, 

: (22) 

t < L„ - 1 + 2m" - {u^ f (m")2, 

and our goal is to prove that these inequalities imply t < t (unless = 
u^, . . . , = u"). Choose (necessarily > unless = u^,...,u" = u"; 
in the latter case, however, we can, and will, also choose If > 0) for which 
gj := — is maximal. Then every value of t satisfying (22) will also satisfy 

n 

t<Li-l + 2u' -Y,{u^f 

n n n 

= - 1 + 2«^ - 2ei - Y,{Wf + 2 ^j^' - E 

j = l j = l j = l 

n n 

< - 1 + 2«^ - J2{u^f - Es' ^ t, 

with the last < following from (21) and becoming < when not all coincide 
with W . □ 

The detailed description of the resulting prediction algorithm was given as 
Algorithm 1 in Section 2. As discussed, that algorithm uses the generalized 
prediction Gjv(w) computed from unnormalized weights. 

6 Conclusion 

In this paper we only considered the simplest prediction problem for the Brier 
game: competing with a finite pool of experts. In the case of square-loss regres- 
sion, it is possible to find efficient closed-form prediction algorithms competitive 
with linear fimctions (see, e.g., [3], Chapter 11). Such algorithms can often be 
"kernelized" to obtain prediction algorithms competitive with reproducing ker- 
nel Hilbert spaces of prediction rules. This would be an appealing research 
programme in the case of the Brier game as well. 
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A Wat kins 's theorem 

Watkins's theorem is stated in [15] (Theorem 8) not in sufficient generality: it 
presupposes that the loss function is perfectly mixablc. The proof, however, 
shows that this assumption is irrelevant (it can be made part of the conclusion), 
and the goal of this appendix is to give a self-contained statement of a suitable 
version of the theorem. 

In this appendix we will use a slightly more general notion of a game of 
prediction (f],r, A): namely, the loss function A : f2 x F ^ M is now allowed 
to take values in the extended real line M := R U {-co, oo} (although the value 
—00 will be later disallowed). 

Partly following [14], for each K = 1,2,... and each a > we consider 
the following perfect-information game Qk (a) (the "global game" ) between two 
players, Learner and Environment. Environment is a team oi K + 1 players 
called Expert 1 to Expert K and Reality, who play with Learner according to 
Protocol 1. Learner wins if, for all A'' = 1, 2, . . . and all fc e {1, . . . , K}, 



otherwise. Environment wins. It is possible that Lat = oo or = oo in (23); 
the interpretation of inequalities involving infinities is natural. 

For each K we will be interested in the set of those a > for which Learner 
has a winning strategy in the game fe(a) (we will denote this by L w ^x(a))- 
It is obvious that 



Ln ^ Lf^ -\- a; 



(23) 



L w gK{a) k.a' > a 



L - fe(a'); 
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therefore, for each K there exists a unique borderline value such that L ^ 
Qk{o) holds when a > Uk and fails when a < ax- It is possible that = oo 
(but remember that we are only interested in finite values of a). 

These are our assumptions about the game of prediction (similar to those in 
[14]): 

• r is a compact topological space; 

• for each w e O, the function 7 e T i— > A(a;, 7) is continuous (M is equipped 

with the standard topology); 

• there exists 7 € F such that, for all a; G O, A(a;,7) < 00; 

• the function A is bounded below. 

We say that the game of prediction (Jl, F, A) is rj-mixable, where ry > 0, if 

V71 G r,72 e r,a G [o, i] 3^ g r Vw g ft-. 

g-r,A(u;,5) > ^g-r,A(w,7i) + _ a)e-'7^('^'72) _ (24) 

In the case of finite fl, this condition says that the image of the superprediction 
set under the mapping (sec (4)) is convex. The game of prediction is perfectly 
mixable if it is ?7-mixable for some t] > 0. 

It follows from [7] (Theorem 92, applied to the means 9Jl^ with (j){x) = e~^^) 
that if the prediction game is 77-mixable it will remain ry'-mixable for any positive 
7]' < r]. (For another proof, see the end of the proof of Lemma 9 in [14].) Let 
T]* be the supremum of the r] for which the prediction game is //-mixable (with 
rj* := when the game is not perfectly mixable). The compactness of F implies 
that the prediction game is 77*-mixable. 

Theorem 2 (Chris Watkins). For any K e {1,2, .. .}, 

InK 



In particular, uk < 00 if and only if the game is perfectly mixable. 

The theorem does not say explicitly, but it is easy to check, that L ^ 
QK{aK)- this follows both from general considerations (cf. Lemma 3 in [14]) 
and from the fact that the SAA wins QKio-K) = GkO^^K/v*)- 

Proof of Theorem 2. The proof will use some notions and notation used in the 

statement and proof of Theorem 1 of [14]. Without loss of generality we can, 
and will, assume that the loss function satisfies A > 1 (add a suitable constant 
to A if needed). Therefore, Assumption 4 of [14] (the only assumption in [14] 
not directly made in this paper) is satisfied. In view of the fact that L ^ 
Qxi^-'Q-K/r}*), we only need to show that L Qk{o) does not hold for a < 
\nK/r]*. Fix a < \nK/ri*. 
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The separation curve, as defined in [14], consists of the points (c(/3), c{(3)/ri) G 
[0, (X))^, where (3 := and rj ranges over [0, cxd] (see [14], Theorem 1). Since 
the two-fold convex mixture in (24) can be replaced by any finite convex mix- 
ture (apply two-fold mixtures repeatedly), setting r] := rf shows that the point 
(1,1/ry*) is Northeast of (actually belongs to) the separation curve. On the 
other hand, the point (l,a/lnii') is Southwest and outside of the separation 
curve (use Lemmas 8-12 of [14]). Therefore, E (^Environment) has a winning 
strategy in the game ^(1, a/lnJC), as defined in [14]. It is easy to see from 
the proof of Theorem 1 in [14] that the definition of the game Q in [14] can be 
modified, without changing the conclusion about Q{1, a/ InK), by replacing the 
line 

E chooses n > 1 {size of the pool} 

in the protocol on p. 153 of [14] by 

E chooses n* > 1 {lower bound on the size of the pool} 
L chooses n>n* {size of the pool} 

(indeed, the proof in Section 6 of [14] only requires that there should be suffi- 
ciently many experts). Let n* be the first move by Environment according to 

her winning strategy. 

Now suppose L ^ fe(a). From the fact that there exists Learner's strategy 
Ci winning Gk{ci-) we can deduce: there exists Learner's strategy C2 winning 
Gk^ (2a) (we can split the experts into K groups of K, merge the experts' 
decisions in each group with Ci, and finally merge the groups' decisions with 
£1); there exists Learner's strategy £3 winning QK^i'Sa) (we can split the K'^ 
experts into K groups of K^, merge the experts' decisions in each group with C2, 
and finally merge the groups' decisions with £1); and so on. When the number 
K"^ of experts exceeds n* , we obtain a contradiction: Learner can guarantee 

Ln < L% + ma 

for all N and all experts k, and Environment can guarantee that 

iiv > + ln(X'") ^L% + ma 
for some N and k. □ 

B Comparison with other prediction algorithms 

Other popular algorithms for prediction with expert advice that could be used 
instead of Algorithm 1 in our empirical studies reported in Section 3 are, among 
others, Kivinen and Warmuth's [10] Weighted Average Algorithm (WdAA), 
Kalnishkan and Vyugin's [9] Weak Aggregating Algorithm (WkAA) , and Freund 
and Schapire's [6] Hedge algorithm (HA). In this appendix we consider these 
three algorithms and three more naive algorithms (which, nevertheless, perform 
surprisingly well). 
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Figure 5: The difference between the cumulative loss of each of the 8 bookmakers 
and of the Weighted Average Algorithm (WdAA) on the football data. The 
chosen value of the parameter c = 1/r] for the WdAA, c := 16/3, minimizes 
its theoretical loss bound. The theoretical lower bound — ln8 w —2.0794 for 
Algorithm 1 is also shown (the theoretical lower bound for the Weighted Average 
Algorithm, —11.0904, can be extracted from Table 1 below). 



The Weighted Average Algorithm is very similar to the Strong Aggregating 
Algorithm (SAA) used in this paper: the WdAA maintains the same weights 
for the experts as the SAA, and the only difference is that the WdAA merges 
the experts' predictions by averaging them according to their weights, whereas 
the SAA uses a more complicated "minimax optimal" merging scheme (given by 
(19) for the Brier game). The performance guarantee for the WdAA applied to 
the Brier game is weaker than the optimal (1), but of course this does not mean 
that its empirical performance is necessarily worse than that of the SAA (i.e., 
Algorithm 1). Figures 5 and 6 show the performance of this algorithm, in the 
same format as before (see Figures 1 and 3). We can see that for the football 
data the maximal difference between the cumulative loss of the WdAA and the 
cumulative loss of the best expert is larger that for Algorithm 1 but still well 
within the optimal bound InK given by (1). For the tennis data the maximal 
difference is about twice as large as for Algorithm 1, violating the optimal bound 
InK. 

In its most basic form ([10], the beginning of Section 6), the WdAA works 

in the following protocol. At each step each expert, Learner, and Reality choose 
an element of the unit ball in R", and the loss function is the squared dis- 



20 



20 



15 



10 



lit 



- theoretical bound for Algorithm 1 
■ Weighted Average Algorithm 

- experts 



.''•'■,^1 



' I. 



/'■ 



2000 



4000 



6000 



8000 



10000 



12000 



Figure 6: The differenee between t he cumtilative loss of each of the 4 bookmakers 
and of the WdAA for c := 4 on the tennis data. 



tance between the decision (Learner's or an expert's move) and the observa- 
tion (ReaUty's move). This covers the Brier game with fl = {1, . . . ,n}, each 
observation uj e il represented as the vector ((5(^{1}, . . . , S^{n}), and each de- 
cision 7 e V{fl) represented as the vector (7{1}, . . . ,-f{n}). However, in the 
Brier game the decision makers' moves are known to belong to the simplex 
{{u^, . . . , w") € [0, oo)" I J27=i ^* ~ -*■}' ^^'^ Reality's move is known to be one 
of the vertices of this simplex. Therefore, we can optimize the ball radius by 
considering the smallest ball containing the simplex rather than the unit ball. 
This is what we did for the results reported here (although the results reported 
in the conference version of this paper [17] are for the WdAA applied to the 
unit cube in R"). The radius of the smallest ball is 

0.8165 if n = 3 
0.7071 if n = 2 
1 if n is large. 

As described in [10], the WdAA is parameterized by c := l/?7 instead of ij, and 
the optimal value of c is c = 8i?^, leading to the guaranteed loss bound 

Ln < min L% + 8R^\nK 

k=l,...,K 

for all 1, 2, . . . (see [10], Section 6). This is significantly looser than the 
bound (1) for Algorithm 1. 
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Figure 7: The maximal difference (25) for the WdAA as function of the param- 
eter c on the football data. The theoretical guarantee In 8 for the maximal dif- 
ference for Algorithm 1 is also shown (the theoretical guarantee for the WdAA, 
11.0904, is given in Table 1). 



The values c = 16/3 and c = 4 used in Figures 5 and 6, respectively, are 
obtained by minimizing the WdAA's performance guarantee, but minimizing 
a loose bound might not be such a good idea. Figure 7 shows the maximal 
difference 



max 

Af=l 6473 



Ln{c) 



min L%j 

k=l 8 



(25) 



where Lm{c) is the loss of the WdAA with parameter c on the football data over 
the first N steps and is the analogous loss of the A;th expert, as a function 
of c. Similarly, Figure 8 shows the maximal difference 



max I Ljsiic) 

JV=1,...,10087 ' 



mm 

fc=l,...,4 



jk 



(26) 



for the tennis data. And indeed, in both cases the value of c minimizing the 
empirical loss is far from the value minimizing the bound; as could be expected, 
the empirical optimal value for the WdAA is not so different from the optimal 
value for Algorithm 1. The following two figures, 9 and 10, demonstrate that 
there is no such anomaly for Algorithm 1. 

Figures 11 and 12 show the behaviour of the WdAA for the value of param- 
eter c = 1, i.e., r] = 1, that is optimal for Algorithm 1. They look remarkably 
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Figure 8: The maximal difFcrcncc (26) for the WdAA as function of the param- 
eter c on the tennis data. The theoretical bound for the WdAA is 5.5452 (see 
Table 1). 
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Figure 9: The maximal difference ((25) with r] in place of c) for Algorithm 1 as 
function of the parameter r] on the football data. 
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Figure 10: The maximal difference ((26) with ry in place of c) for Algorithm 1 
as function of the parameter r] on the tennis data. 
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Figure 11: The difference between the cumulative loss of each of the 8 book- 
makers and of the WdAA on the football data for c = 1 (the value of parameter 
minimizing the theoretical performance guarantee for Algorithm 1). 



similar to Figures 1 and 3, respectively. 

The following two algorithms, the Weak Aggregating Algorithm (WkAA) 
and the Hedge algorithm (HA), make increasingly weaker assumptions about 
the prediction game being played. Algorithm 1 computes the experts' weights 
taking full account of the degree of convexity of the loss function and uses a 
minimax optimal substitution function. Not surprisingly, it leads to the optimal 
loss bound of the form (2). The WdAA computes the experts' weights in the 
same way, but uses a suboptimal substitution function; this naturally leads to 
a suboptimal loss bound. The WkAA "does not know" that the loss function is 
strictly convex; it computes the experts' weights in a way that leads to decent 
results for all convex functions. The WkAA uses the same substitution function 
as the WdAA, but this appears less important than the way it computes the 
weights. The HA "knows" even less: it does not even know that its and the 
experts' performance is measured using a loss function. At each step the HA 
decides which expert it is going to follow, and at the end of the step it is only 
told the losses suffered by all experts. Therefore, it is not surprising that the 
WkAA does not perform as well as Algorithm 1 and the WdAA with c = 1; 
the performance of the HA is even weaker: see Figures 13-16. The HA is a 
randomized algorithm, so we show the expected performance. 

Figures 13-16 show the performance of the WdAA and the HA for all possible 
values of their parameters (c and /?, respectively) . We do not show the optimal 
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Figure 12: The difference between the cumulative loss of each of the 4 book- 
makers and of the WdAA for c = 1 on the tennis data. 
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Figure 13: The maximal difference for the Weak Aggregating Algorithm 
(WkAA) as function of c on the football data. 
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Figure 14: The maximal difference for the WkAA as function of c on the tennis 
data. 
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Figure 15: The expected maximal difference for the Hedge algorithm (HA) and 
for the SAA Hedge algorithm (SAA-HA) as a function of (3 on the football data. 
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values of parameters since neither algorithm satisfies a loss bound of the form 
(2) (typical loss bounds for these algorithms allow A to depend on N, and the 
optimal value would also depend on N). 

In the case of the HA, the loss bound given in the original paper [6] was 
replaced, in the same framework, by a stronger bound in [14] (Example 7). The 
stronger bound is achieved by the SAA applied to the HA framework described 
above (with no loss function); this algorithm is referred to as SAA- HA in the 
captions. The description of the SAA-HA given in [14] admits some freedom in 
the choice of Learner's decision; our implementation replaces the HA's weights 
p'', k = 1,. . . ,K, with 

-■n(l + (/;-l)p>) 
-Ef.ito(l + (/3-l)P') 

The losses suffered by the HA and the SAA-HA are very close. 

An interesting observation is that, for both football and tennis data, the loss 
of the HA is almost minimized by setting its parameter /3 to (the qualification 
"almost" is necessary in the case of the tennis data as well: the lines of maximal 
difference in Figure 16 are not monotonic for (3 extremely close to 0). The 
HA with (3 = coincides with the Follow the Leader Algorithm (FLA), which 
chooses the same decision as the best (with the smallest loss up to now) expert; 
if there are several best experts (which almost never happens after the first 
step), their predictions are averaged with equal weights. Standard examples 
(see, e.g., [3], Section 4.3) show that this algorithm (unlike its version Follow 
the Perturbed Leader) can fail badly on some data sequences. However, its 
empirical performance (Figures 17 and 18) on our data sets is not so bad: it 
violates the loss bounds for Algorithm 1 only slightly. 

The decent performance of the Follow the Leader Algorithm suggests check- 
ing the empirical performance of other similarly naive algorithms. The Simple 
Average Algorithm's decision is defined as the arithmetic mean of the experts' 
decisions (with equal weights). Figures 19 and 20 show the performance of this 
algorithm. It does violate the theoretical loss bound for Algorithm 1, but not 
significantly (especially in the case of football data). 

The last naive algorithm that we consider is in fact optimal, but for a dif- 
ferent loss function. The Bayes Mixture Algorithm (BMA) is the Strong Ag- 
gregating Algorithm applied to the log loss function. This algorithm has a very 
simple description [13] , and was studied from the point of view of prediction 
with expert advice already in [5]. Figures 21 and 22 show the performance of 
the BMA measured by the Brier loss function, as usual. The performance is 
excellent for the football data but much weaker for tennis. 

Despite the decent performance of the three naive algorithms on our two 
data sets, there is always a danger of catastrophic performance on some data 
set: there are no performance guarantees for these algorithms whatsoever. It 
is an important advantage of more sophisticated algorithms that they establish 
some upper bound on the algorithm's regret. 

Precise numbers associated with the figures referred to above are given in 
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Figure 16: The expected maximal difference for the HA and for the SAA-HA 
as a function of (3 on the tennis data. 
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Figure 17: The difference between the cumulative loss of each of the 8 book- 
makers and of the Follow the Leader Algorithm on the football data. 
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Figure 18: The difference between the cumulative loss of each of the 4 book- 
makers and of the Follow the Leader Algorithm on the tennis data. 
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Figure 19: The difference between the cumulative loss of each of the 8 book- 
makers and of the Simple Average Algorithm on the football data. 
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Figure 20: The difference between the cumulative loss of each of the 4 book- 
makers and of the Simple Average Algorithm on the tennis data. 
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Figure 21: The difference between the cumulative loss of each of the 8 book- 
makers and of the Bayes Mixture Algorithm on the football data. 
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Figure 22: The difference between the cumulative loss of each of the 4 book- 
makers and of the Bayes Mixture Algorithm on the tennis data. 



Tables 1 and 2: the second column gives the maximal differences (25) and 
(26), respectively. The numbers preceded by ">" are the maximal differences 
corresponding to the best value of parameter chosen in hindsight, after seeing 
the data set. Therefore, the corresponding numbers involve "data snooping" 
and cannot serve as a fair measure of performance. The third column gives the 
theoretical performance guarantees (if available). 



33 



Algorithm 


Maximal difference 


Theoretical bound 


Algorithm 1 


1.1562 


2.0794 


WdAA (c = 16/3) 


1.6619 


11.0904 


WdAA (c = 1) 


1.1281 


none of the form (2) 


WkAA 


> 1.8933 


none of the form (2) 


HA (expected) 


> 2.3694 


none of the form (2) 


SAA-HA (expected) 


> 2.3882 


none of the form (2) 


Follow the Leader Algorithm 


2.7983 


none 


Simple Average Algorithm 


2.5422 


none 


Bayes Mixture Algorithm 


1.0602 


none 



Table 1: The maximal difference between the loss of each algorithm and the loss 
of the best expert for the football data (second column); the theoretical upper 
bound on this difference (third column). 



Algorithm 


Maximal difference 


Theoretical bound 


Algorithm 1 


1.2021 


1.3863 


WdAA (c = 4) 


2.4450 


5.5452 


WdAA (c = 1) 


1.1089 


none of the form (2) 


WkAA 


> 1.5059 


none of the form (2) 


HA (expected) 


> 1.4153 


none of the form (2) 


SAA-HA (expected) 


> 1.3909 


none of the form (2) 


Follow the Leader Algorithm 


1.5597 


none 


Simple Average Algorithm 


3.7928 


none 


Bayes Mixture Algorithm 


4.6531 


none 



Table 2: The maximal difference between the loss of each algorithm and the loss 
of the best expert for the tennis data (second column); the theoretical upper 
bound on this difference (third column). 
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